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Abstract 

Transcription factor (TF) are proteins that regulates the transcription of genetic information 
from DNA to messenger RNA by binding to a specific DNA sequence. Nucleic acid-protein 
interactions are crucial in regulating transcription in biological systems. This work presents a 
quick and convenient method for constructing tight-binding models and offers physical insights 
into the electronic structure properties of transcription factor complexes and DNA motifs. The 
tight binding Hamiltonian parameters are generated using the random forest regression algorithm, 
which reproduces the given ab-initio level calculations with reasonable accuracy. We present a 
library of residue-level parameters derived from extensive electronic structure calculations over 
various possible combinations of nucleobases and amino acid side chains from high-quality DNA- 
protein complex structures. As an example, our approach can reasonably generate the subtle 
electronic structure details for the orthologous transcription factors human AP-1 and Epstein-Barr 
virus Zta within a few seconds on a laptop. This method potentially enhances our understanding of 
the electronic structure variations of gene-protein interaction complexes, even those involving 
dozens of proteins and genes. We hope this study offers a powerful tool for analyzing transcription 


regulation mechanisms at an electronic structural level. 


Introduction 

Protein-DNA interactions play a crucial role in various biological processes, such as gene 
regulation, transcription, DNA replication, repair, and packaging.(1—4) For decades, the quest to 
understand the intricate relationships between DNA and proteins has been at the heart of 
biological research.(5—10) These nucleic acid-protein interactions usually occur in two ways: non- 
specifically, such as the interaction between histones and DNA, and through highly selective, 
sequence-specific binding, as seen in transcription factors. This distinction is essential for 
numerous biological functions, ranging from gene regulation to DNA repair.(11) Eukaryotic DNA 
is packaged into nucleosomes (Figure 1).(12—15) The nucleosome core particle (NCP) is the 
fundamental unit of DNA packing in eukaryotic cells. It consists of an octamer of histone proteins 
around which approximately 150 base pairs of DNA are bound.(16—18) The fundamental unit of 
DNA packing inside eukaryotic cells is the nucleosome core particle (NCP), in which 
approximately 150 base pairs of DNA are bound around an octamer of histone proteins. 
Transcription factors (TFs) act as mediators of genetic information, directing the complex process 
of transcription, in which DNA is transcribed into RNA, a precursor to protein synthesis.(6—10, 


19-21) 
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Figure 1. The hierarchical structure of the chromosome organization with emphasis on 


transcriptional regulation, starting from the chromosome level, through chromatin and the 
nucleosome core particle, to the DNA helix. The atomic-resolution structures of the NCP and TF 
are also given. 

The activator protein-1 (AP-1) is a regulatory element that is present in many promoter and 
enhancer regions. AP-1 plays a crucial role in regulating gene transcription across various 
biological functions, highlighting its versatility in cellular biology.(22—25) And it is characterized 
by the presence of a highly conserved DNA binding domain that contains an N-x 7-R/K sequence 
and a basic leucine zipper (bZip) domain.(26—32) The relatively poorly conserved leucine zipper 
region is characterized by leucine in the last position of every seven amino acids, and hydrophobic 
residues.(28, 33, 34) AP-1 proteins are a versatile family of dimeric transcription factors. Jun 
protein is a member of the AP-1 proteins. It has the ability to form homodimers or heterodimers 
with other proteins. The c-Jun protein promotes cell cycle progression by repressing the p53 
tumor suppressor and activating cyclin D1. This reduces the influence of the cyclin-dependent 
kinase inhibitor (CDKI) p21, facilitating the G1 to S phase transition.(35—38) 

Exploring the impact of electron injection on DNA-binding proteins is important in various 
research fields. Ultrafast electron transfer occurs during the recognition of various DNA 
sequences by a DNA-binding protein with distinct dynamic conformations.(39-44) DNA damage 
and repair mechanisms involve electron transport. For instance, positive charge transfer can 
promote oxidative damage to guanine in DNA, which may be related to the presence of mutation 
sites in the genome.(45—54) DNA transcription factors such as SoxR and p53, which are equipped 
with redox-active groups, use DNA charge transport as a redox sensing mechanism.(55—58) The 
DNA-mediated charge transport might enable signaling between the [4Fe4S] clusters in the human 
DNA primase, polymerase a, and other replication and repair high-potential [4Fe4S] proteins.(59— 
63) This DNA charge chemistry serves as both a sensing method and a monitor of DNA integrity, 
which is sensitive to base stacking perturbations caused by mismatches or DNA damage. 

Quantum chemistry provides chemists with critical insight into the electronic structure 
behavior of DNA or protein molecules, but its extensive computational requirements limit the 
scope and variety of systems that can be effectively analyzed.(64-68) The tight-binding (TB) 
method offers a more practical alternative for describing the electronic Hamiltonian using smaller 


and more sparse matrices.(69—74) In early work, the TB model was applied to materials science or 


solid state physics. The TB model has been applied to molecular clusters or biomolecular 
systems.(75—81) Traditionally, the TB Hamiltonians have relied on empirical or semi-empirical 
parameters, which raises concerns about their accuracy and general applicability.(82-89) A few 
works are developed to improve the accuracy and dependability of TB models through the 
foundation of first-principles calculations.(90—92) 

The Protein Data Bank (PDB) has provided a continuous influx of high-resolution structural 
data, which has significantly advanced our understanding of protein-DNA interactions.(93—96) 
The increasing number of high-quality experimental protein and DNA structures, including those 
obtained through X-ray, NMR, and cryo-EM techniques, have provided opportunities to improve 
our TB parameters for biological systems. As previously proposed, it is possible to derive TB 
parameters for millions or even billions of molecular fragments, which represent most occurrences 
in protein and DNA databases(92, 97). Integrating structural insights, especially regarding residue 
preferences in protein-DNA interactions, is essential for understanding charge transfer 
mechanisms. Although accuracy is improved, constructing the Hamiltonian is time-consuming due 
to the cost of ab initio calculations and the projection step. Furthermore, the resulting ab initio TB 
Hamiltonian is not transferable to new structural configurations, which limits its usefulness for 
electronic structure simulations. Nowadays, machine learning algorithm in computational 
chemistry(98—106) has been widely used to predict interaction energies, molecular forces(107, 
108), electron densities(109), density functionals(110) and various molecular response 
properties(111—-114) The machine learning algorithm can be used to predict accurate TB 
Hamiltonian for unseen structures during atomic structure explorations. Therefore, the machine 
learning method for TB Hamiltonian parameterization is desired. 

In this work, we investigate DNA-protein interactions in transcriptional regulation with a 
focus on transcription factors, which regulates the transcription of genetic information from DNA 
to messenger RNA by binding to a specific DNA sequence. A comprehensive library of residue- 
level tight binding parameters is constructed from detailed electronic structure calculations. The 
library covers millions of nucleic base and amino acid side-chain combinations extracted from 
high-quality DNA-protein complex structures. TB Hamiltonian parameters derived from ab-initio 
calculations are accurately generated using a random forest regression algorithm. Despite its 


simplifications, the direct diagonalization of the TB Hamiltonian could generate various electronic 


structure properties of DNA-protein complexes. Our approach quickly reproduces the electronic 
structure details of orthologous transcription factors, such as human AP-1 and Epstein-Barr virus 
Zta(115, 116), in seconds using a laptop. We anticipate that our study will serve as a powerful tool 
for analyzing transcription regulation mechanisms at an electronic structural level. And this 
methodology opens up possibilities for comprehending the electronic structure variations observed 
in millions of protein-gene complexes or dozens of gene-protein complexes, in the big data 


scenario. 


2. Methods and Computational Details 
Construction of the Nucleobase-Amino Acid Library 

The DNA-protein complexes contain only the twenty L-amino acids and four 
deoxynucleotides, which are generally distinguished by their different side chain structures and 
chemical compositions (Figure 2). DNA-backbone interactions are the most numerous and 
contribute to the stability of the DNA-protein complex. In contrast, side-chain interactions of the 
protein are fewer but confer specificity by recognizing the unique features of the DNA sequence. 
The TB parameter library currently includes collections of all possible combinations of amino 
acids and nucleobases, specifically the amino acid/amino acid (AA), base/base (BB), and amino 
acid/base (AB) interaction patterns. Our previous work(92, 97, 117) has thoroughly studied the 
AA and BB conformers, so this study will focus solely on the AB conformers. Note that the BB 
conformers in previous work were generated from customized DNA models using packages such 
as x3DNA(94). In this work, we have updated the BB conformers based on experimental DNA 
protein structures. The procedure to extract each conformer from the available three dimensional 
DNA binding protein structures follows the work of Singh and Thornton(118). This library 
comprises around 1.2 million conformers that cover a broad range of nucleic acid sequences and 
protein families, ensuring representation across different binding modes. The initial structures in 
the library only contain the coordinates of the heavy atoms. The missing hydrogen atoms were 
added using the tleap module in the AmberTools package(119). Three protonation states were 
calculated for histidine, and two possible protonation states were considered for other acidic and 


basic amino acids. 
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Figure 2: Illustration of one of the studied nucleobase-amino acid system (PDB ID: 2H7H). (b) 
Depiction of the nucleobase’s phosphate group linked to a sugar ring, which in turn is bonded to a 
base. Adjacent is the general structure of an amino acid, with its variable side chain represented by 
"R" in a dashed outline. (b) Spatial distribution patterns of the interactions between the cytosine 
base (CYT) from the nucleotide and the glutamine (GLU) side chain. The clusters highlight 


various conformers. 


The Data Driven Tight Binding Model for Biomolecules 

The tight binding model is a robust framework for studying the electronic properties of large 
and intricate molecular systems. The foundational principles of the tight binding model for 
molecule systems, including the derivation process, have been detailed in previous publications 
from us(92, 97, 117, 120) or contributions by others(121, 122). Here, we only describe our 
methodologies for calculating on-site energies, charge transfer couplings, and the Löwdin 
transformation in our current research. 

Biomolecules are composed of repeated structural units, such as amino acids for proteins and 
nucleotides for DNA. In the tight-binding approximation, electrons have limited interactions with 


non-neighboring sites. The formulas for on-site energy and transfer integral are provided below: 
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The summation runs over all possible sites L. However, only the neighboring sites need to be 
considered in the TB approximation. And e represents the on-site energy and ¢ represents the 
transfer integral between sites. øn refers to the molecular orbital of one structural unit n. Therefore, 
the on-site energy for site n only requires the potential information of site n and its closest 


neighboring sites C. The formula for on-site energy can be simplified as follows: 
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According to Equation 3, the on-site energy is not solely determined by the orbital energy of 
site n; it also includes contributions from adjacent sites, particularly the first set of nearest 
neighbors, denoted as C. The model can take into account the impact of neighboring residues on 
the on-site energy. 

The transfer integral describes the ability to perform charge transfer among neighboring sites, 
while the on-site energy describes the ability to move or inject an electron from a specific site. The 


transfer integral only require the potential of site n and n+1, that is 
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In this work, we utilize the Löwdin method to minimize orbital overlap, as the tight binding 
model corresponds to the orthogonal basis. This enables us to transform the effective transfer 


integral. 
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Equation 5 defines s as the orbital overlap integral between sites. This transformation has 
minor effects on the on-site energy and can be safely ignored if necessary. The TB parameters 
have been extensively studied for pure DNA complexes and protein complexes in the previous 
work(92, 123). 

In the framework of the tight binding Hamiltonian, the on-site energy and transfer integrals 


are characterized as the diagonal and off-diagonal matrix elements, respectively. Diagonal 


elements correspond to the on-site energy for a given orbital or site, which signifies the energy 
level of an electron when it is localized at that site. Conversely, off-diagonal elements quantify the 
transfer integral, indicative of the probability of an electron’s transition between sites, which is a 
measure of the charge transfer couplings within the molecular system. 

Another practical difficulty is the inefficiency in constructing the TB Hamiltonian from ab 
initio calculations. Here, the random forest (RF) regression is utilized to predict TB parameters 
within the BioTinter-1m framework. The RF regression model is employed as a multi-input and 
multi-output framework(124—129), enabling the simultaneous prediction of all TB parameters. 
This method constructs an ensemble of decision trees from varied segments of the training data, 
enhancing model diversity and robustness. Each decision tree’s construction is guided by random 
subsets of features, enabling nuanced learning from the dataset. The RF model averages 
predictions across all trees to estimate molecular descriptors, as implemented in the scikit-learn 
module(130) in Python. The ensemble of 150 trees balances computational efficiency with 
predictive accuracy. 

Although various machine learning techniques were explored, including deep learning 
methods(131—136), the findings indicate that the performance of deep neural networks does not 
surpass that of the RF model. The limited success observed in our studies with deep neural 
networks can often be attributed to insufficient data in the training set. Although our library 
contains millions of biomolecular residues, only a few hundred or thousand conformers are 
available for each type of AA, BB, or AB combination. Our initial test with the deep neural 
network model implemented in PyTorch resulted in a correlation coefficient below 0.92 and was 
therefore not reported. In contrast, the RF model showed the lowest correlations of 0.95 or higher 
(see Table S1). Expanding the dataset by a factor of 100 or 1000 could potentially enhance the 
predictive capability of deep learning networks and improve the overall understanding of 
biomolecular electronic structure variations. In our preliminary evaluations, the deep neural 
network model, implemented using the PyTorch framework(137), exhibited the correlation 
coefficient of less than 0.91, which did not meet our benchmark criteria for inclusion in this study. 
The RF model demonstrated relatively superior performance, consistently achieving correlation 
coefficients of 0.95 or above, as detailed in Table S1. We hypothesize that augmenting our dataset 


by an order of magnitude, specifically by factors of 100 to 1000, might significantly enhance the 


ability of deep neural network to predict and thereby offer more profound insights into the 
variability of electronic structures in biomolecular systems. 

After constructing the TB Hamiltonian, we can solve the well-known eigenvalue equation 
(HC=EC) directly for electronic structure calculations of any bio-molecules. The electron-ion 
dynamics can also be solved within the TB framework. These methods are implemented in our in- 
house code BioTinter (Tight-binding model for Biomolecular interactions). Because this code 
carries a TB parameters library of 1.2 million conformers, we would also refer to it as BioTinter- 


lm. The workflow of BioTinter is shown in Scheme 1. 
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Scheme 1. The workflow and code structure of BioTinter package used in this work. 


The BioTinter framework employs a layered architecture to integrate TB parameters into 
quantum chemistry workflows, significantly enhancing the computational efficiency and accuracy 
of molecular simulations involving DNA-protein complexes. At its core, the Database (DB) layer 
hosts an extensive library of pre-calculated TB parameters. Absent parameters trigger the 
Quantum Mechanics (QM) layer, which calculates needed parameters via interfaces with 
Orca(138) and Gaussian(139) to compute the requisite parameters. This process is augmented by 
the bioTB module, as detailed in our preceding publications(92, 97). The Machine Learning (ML) 
layer predicts TB parameters for novel conformers, enabling the construction of the TB 
Hamiltonian for simulations. Initial structural data for simulations are sourced from the Protein 


Data Bank (PDB), MD trajectories, or tools like x3DNA(94) and AlphaFold(140). BioTinter-1m 


prioritizes a balance between speed and accuracy, resorting to on-the-fly QM calculations when 
necessary. This on-the-fly module ensures that even with a vast database, the system remains 
responsive and accurate. The upcoming public release of BioTinter-10b may weaken this on-the- 
fly module, as the conformer library is expected to expand to ten billion entries along with deep 
neutral network model. 

Simulation Details 

In order to construct the TB parameters library, the positions of hydrogen atoms were 
optimized for each dimer using B3LYP/6-31G(d) calculations. We kept the coordinates of the 
heavy atoms fixed during the optimization process. The on-site energies and charge transfer 
couplings for each dimer are derived from at the HF/6-31G(d) and B3LYP/6-31G(d) level 
according to the idea of tight-binding approximation as our previous work.(92, 97) The solvent 
effects were considered with the implicit solvation model if necessary. Quantum chemistry 
calculations can be performed using either the Gaussian or Orca package, both of which have been 
interfaced with BioTinter. 

In the ML layer, the relative positions of molecules are described through their internal 
coordinates (IC), the Coulomb Matrix (CM) and Smooth Overlap of Atomic Positions (SOAP) 
descriptors. For a comprehensive understanding of CM and SOAP descriptors, we recommend 
referring to existing literature.(141-144) Our analysis considers the effect of including or 
excluding hydrogen atoms in these molecular representations. Benchmark results (Table S1) 
reveals that presence of hydrogen atoms does not significantly affect our model's predictions. This 
research primarily uses hydrogen-depleted CM descriptors, which are refined using a norm sorting 
technique. While the SOAP model introduces a more complex approach, it only slightly improves 
predictive accuracy. Therefore, our approach in BioTinter-1m prioritizes hydrogen-depleted CM 
descriptors for simulating DNA-protein systems. 

To illustrate the utility of the tight-binding (TB) model, we investigate the electronic 
structure variations in complexes involving Activator Protein 1 (AP-1) and Epstein-Barr Virus Zta 
transcription factors with their associated nucleic acids. The coordination of this sophisticated 
computational process is facilitated by the Snakemake workflow management system(145, 146). 
Calculations are monitored and streamlined using custom Python scripts developed for the 


BioTinter packages, ensuring an automated and efficient workflow. Subsequent statistical analysis 


of the results is performed using R scripts, providing a comprehensive assessment of the models' 


predictive accuracy. 


Results and Discussions 

TB parameters were calculated for thousands of AB conformers to analyze the specialization 
of amino acid or nucleic base distributions in realistic DNA-protein complexes. A complete tight 
binding Hamiltonian can be constructed for any DNA-protein complex by combining previously 
reported TB parameters from AA and BB libraries(92, 97). After collecting the AA, BB, and AB 
distributions, there are approximately one million conformers. This library is useful for describing 
how the conformation ensemble influences TB parameters within distinct protein structures. For 
instance, the TB parameters library allows for the extraction of explicit geometric correlation with 
the charge transfer couplings. It is commonly observed that the values of the charge transfer 


couplings rapidly decay, decreasing to negligible levels at distances closer than 6.0 A. 
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Figure 3. The PCA visualization of a spectrum of TB parameters involving HOMO and LUMO 
orbitals. The visualization includes the four types of nucleic bases, which are the components for 
any possible DNA sequence. The confidence ellipse represents a statistical probability of 95% that 


encloses a certain percentage of the data points based on their distribution along the principal 


components. 


The principal component analysis (PCA) algorithm was used to categorize various AB 
parameters and correlate them with their physical properties. Figure 3 displays a two-dimensional 
(2D) plot from PCA that separates the data into distinct clusters. The color coding represents 
different amino acid characteristics: acidic (red), basic (blue), hydrophobic (purple), and polar 
(gray), highlighting the chemical nature of the residues as a pivotal factor in the variability of tight 
binding parameters. The numbers in the brackets on the PC1 and PC2 axes of the PCA plot 
represent the percentage of the variance in the dataset that is explained by each principal 
component. This plot also demonstrates the intrinsic distribution of parameters within each cluster, 
distinctly influenced by nucleobase type—adenine (ADE), thymine (THY), guanine (GUA), and 
cytosine (CYT). To ensure functional selection independence, TB parameters were calculated 
using the Hartree-Fock (HF) method. For comparison, TB parameters were also calculated using 
the B3LYP level method, as shown in Figure S1. The PCA plots resulting from the B3LYP 
calculations confirm the segregation of data into distinct clusters, as observed with HF 
calculations. The spatial arrangement of TB parameters in AB conformers is primarily determined 
by the chemical nature and charge state of the amino acid residues. Secondary factors include the 


type of nucleobase and the choice of DFT functional. 
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Figure 4. Comparative analysis of the average hopping integrals for HOMO and LUMO across 


AB conformers. The absolute values are used. Histidine is represented in its three protonation 


states: HID, HIE, and HIP. The x-axis label uses color coding to differentiate amino acids based 


on their chemical properties, including hydrophobicity, and polarity, acidity, basicity, 


Figure 4 shows a detailed analysis of the average hopping integrals between each of the four 
nucleobases and twenty standard amino acids. This figure also highlights the varying interaction 
strengths of histidine in its three protonation states: HID, HIE, and HIP, which reflect the different 
coupling strengths in various biochemical environments. The charge transfer integrals between 
nucleobases and various amino acids exhibit significant differences. Each nucleobase has its own 
preferred interacting amino acid with specific charge transfer couplings. This is fundamental in 
comprehending the dynamics of DNA-protein interactions at the electronic and molecular levels. 
Aromatic amino acids, such as histidine, phenylalanine, tryptophan, and tyrosine, generally exhibit 
significant charge transfer couplings. This phenomenon may be caused by either the n-n 
interaction or the C-H-z interaction, which could significantly enhance the possibility of electron 
transfer. The average on-site energy difference for such AB conformers is often within 1.0 eV or 
even lower. Other residues, such as serine (SER), cysteine (CYS), and methionine (MET), may 
also have slightly larger couplings involving the oxygen or sulfur atom in the side-chain. The on- 
site energy differences are approximately 1.0 eV for MET and CYS involving the sulfur atom, 
while the SER involving the oxygen atom has an on-site energy difference as large as 2.0-3.0 eV. 
The couplings for ILE/ADE are relatively large for the LUMO orbitals. However, their on-site 
energy difference is as large as 4.0 eV. Similar findings are observed with TB parameters 
calculated at the B3LYP level (Figure S2). Averaged over all amino acids, the nucleobases have 
the largest charge transfer integrals for THY (0.026 eV) and ADE (0.023 eV), followed by GUA 
(0.018 eV) and CYT (0.021 eV). The same trend is observed for the LUMO orbitals, where the 
largest charge transfer integrals have a larger value for ADE (0.054 eV) and THY (0.051 eV), and 
a smaller value for CYT (0.050 eV) and GUA (0.033 eV). 

Charge transfer couplings are reported to exhibit high sensitivity to the structural orientation 
of molecular fragments. Figure S3 shows several AB structure contacts, where each cluster in the 
same AB pairs has significantly different distributions. The population of charge transfer 
couplings are “encoded” in various model of geometric contacts, i.e. the 2-2 interactions, C-H-2 


interactions, the hydrogen bonds or van der Waals contacts. The orientation of aromatic molecules 


can either enhance or diminish charge transfer couplings. The chemical diversity and specificity of 
various AB conformers can exhibit subtle differences in molecular structure or electronic 
properties, even within seemingly homogeneous groups. Note that the charge transfer couplings 
are not symmetric due to the inhomogeneity of DNA-protein structures, and the distribution of one 


type of amino acid in the frame of another reference nucleobase residue type is distinct. 
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Figure 5. The predictive performance of the machine learning algorithm is evaluated based on 
two types of molecular descriptors: (a) intermolecular coordinates and (b) hydrogen-depleted 


Coulomb matrices. The color-coded data points represent different nucleobases 


As the possible structural changes will influence the electrical properties of a DNA protein 
complexes, the reasonable description of transfer couplings beyond the empirical formulas is very 
necessary. Figure 5 shows the predictive performance of the RF model for the TB parameters of 
arbitrary conformers, in correlating TB parameters library. The intermolecular coordinate system 
uses the distance (r), planar angles (0, @), and dihedral angles (y), providing a detailed set of 
molecular descriptors that encapsulate the spatial orientation of the molecules. The Coulomb 


matrix leverages atomic numbers (Z) and interatomic distances (R). This approach highlights how 


electronic properties are influenced by atomic identities and their spatial relationships. It 
emphasizes the importance of both atomic composition and geometric arrangements in 
determining the electronic characteristics of molecules. These descriptors are essential to machine 
learning models for predicting molecular properties. The correlation between actual and predicted 
on-site energy is very robust, with the line of best fit closely aligning with the ideal. The internal 
coordinates can only be successful in predicting the on-site energy, and often difficult to predict 
the charge transfer couplings. This suggests that the internal molecular geometries are also very 
important. 

We trained the model using the 8:2 training/test ratio. Then, one could achieve a unification 
of accuracy and efficiency to construct TB Hamiltonian for realistic DNA, protein or DNA-protein 
complexes. To facilitate the use of experimental DNA and protein structures, we also compare the 
molecular descriptors with and without hydrogen atoms, and the results are shown in Table S1. 
The possibility of prediction errors in certain scenarios could lead to outliers, we have established 
criteria for identifying similarity between descriptors. These criteria include an average distance of 
less than 0.1 A between two descriptors treated as vectors, and an angle of less than 30 degrees 
between multidimensional vectors exceeding three dimensions. This involves ensuring that the 
average distance between any two descriptors, viewed as vectors, is less than 0.1 A, and the angle 
between any vectors is less than 30 degrees. 

Before examining realistic systems, we first conducted an evaluation of the performance of 
our TB parameters. Figure S4 compares the HOMO/LUMO gap for randomly generated one 
thousand of dimer and trimer conformers involving nucleobases or amino acids. The results 
indicate that our prediction algorithm achieves deviations of 0.1~0.2 eV, which is quite successful 
for such simplified TB model. The randomly generated dimers and trimers for AA configurations 
were derived from existing PDB databases, BB structures were partly derived from PDB and 
partly generated by x3DNA, while mixed AB structures were mainly derived from dimer and 
trimer structures at transcription factor binding sites. The insights gained from these benchmarks 
can be used to optimize computational strategies for modeling biological systems. In addition, the 
HOMO/LUMO gaps for nucleobases typically reflect their electronic properties and can vary 
depending on the computational method used for calculations.(147—153) Because the calculated 


HOMO/LUMO gap at HF level is very large (9~10 eV) than experimental values, while B3LYP 


provide reasonable results (4.0~5.0 eV). The TB parameters derived from B3LYP calculations 
would be used for realistic DNA-protein complexes in the following discussions. 

The applicability of the BioTinter-1m model was evaluated by studying transcription factors, 
which are key proteins in the regulation of gene expression. They modulate the activation and 
repression of specific genes by binding to adjacent DNA sequences. Each transcription factor 
recognizes and binds to a specific sequence in the DNA alphabet (A, C, G, and T) known as a 
consensus site. Jun protein is an AP-1 protein, that recognizes two versions of a 7-base pair 
response element (Figure 6b), either TRE (5’-TGAGTCA-3’) with PDB ID: 2H7H or meTRE (5’- 
MGAGTCA-3’) where M = 5-methylcytosine, with PDB ID: 5T01. These elements differ only at 
the first base pair (bp): with T:A in TRE and 5mC:G (M:G) in meTRE. c-Jun can form both 
homodimers and heterodimers. Epstein-Barr Virus (EBV) Zta is a key transcription factor of the 
viral lytic cycle that is homologous to AP-1. The EBV viral genome is unmethylated, but becomes 
highly methylated during the latent stage of the viral cycle.(154, 155) Figure 6a illustrates the 
amino acid sequences of the human Jun protein, the Epstein-Barr virus Zta protein, and a mutant 
variant of the Zta protein (S186A), referred to as Zta* in this study. Zta* is designed to mimic the 
AP-1 protein in its interaction with the TRE DNA element, with the comparison based on the 
crystal structure identified by PDB ID: 2C9L. Both human AP-1 and EBV Zta are bZIP family 
transcription factors that bind the classical TRE. They also recognize methylated cytosine residues 
within different sequence contexts.(156, 157) The extensive TB parameters library is large enough 
to represent most possible AA, BB and AB conformers found in realistic DNA and protein 
structures, with prediction failures under 5% across different systems. The introduction of the 
BioTinter-10m model, encompassing ten million conformers, is anticipated to drastically reduce 
prediction errors to less than 0.1%. This process utilizes both the extensive TB parameters library 
and a minimal set of on-the-fly ab initio calculations, ensuring the robustness and accuracy of our 


predictions. 
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Figure 6: Comparative analysis of the protein-DNA interaction complexes. (a) Sequence 
alignment of the protein Jun, Zta and Zta* with highlighted differences. (b) Sequence alignment of 
the DNA elements TRE, meTRE. Visualization of (c) Jun/Jun binding to TRE, (d) Jun/Jun binding 
to meTRE, and (e) Zta*/Zta* binding to TRE, with the DNA-protein interface marked by a red 
circle, and corresponding charge transfer networks analysis for HOMO and LUMO orbitals. The 


size of a network node is related to its degree within the network. 


Figure 6 presents a comprehensive view of the interaction between transcription factors and 
DNA. Each three-dimensional structure is accompanied by a schematic diagram of DNA-protein 
interface, highlighting the interactions between amino acids and nucleotides, and is complemented 
by a graphical representation of the charge transfer network. The electronic Hamiltonian of 
biological molecules diverges from the simple tridiagonal matrix characteristic of linear molecules 
due to the complex stacking arrangements of nucleobases and amino acids found in actual DNA- 


protein structures. In prior research, the concept of a knowledge graph was introduced as a 


visualization tool for TB Hamiltonian for Biomolecules. In order to construct the DNA-protein 
charge transfer network, each residue is represented by a vertex in the graph, and the edge 
represents the strength of charge transfer coupling among residues. To keep similar geometric 
feature as the TF molecules, we use the Kamada-Kawai layout to generate the complex network. 
The Kamada-Kawai algorithm is a force-directed graph layout algorithm that emphasizes the 
consistency between the geometric distances and graph-theoretic distances between nodes.(158) 
The threshold of significant charge transfer coupling is set to be 0.001 eV in this work. 
Methylation can cause significant changes in DNA-protein interactions, which may result in 
notable alterations in gene expression patterns. Variations in nucleic acid sequences can have a 
significant impact on the distribution of TB Hamiltonian matrix elements at the nucleic acid- 
protein interface. This is demonstrated in the binding of the Jun/Jun protein to TRE and meTRE 
sequences, as shown in Figures 6c and 6d. Similarly, alterations in protein sequences impact both 
the protein termini and the nucleic acid-protein interface. This is exemplified in the interactions of 
Jun and the Zta* mutant protein with the TRE sequence in Figure 6c and 6e. Charge transfer 
networks in these DNA-protein complexes, illustrating the intricate pathways of electronic 


interactions within the binding interface. 
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Figure 7. Comparative visualization of molecular orbitals across energy levels and the 
corresponding HOMO/LUMO distributions with complex network representation for different 
transcription factor-DNA complexes in (a) vacuum and (b) implicit water solvent. The coloring 


scheme is the same as Figure 6. 


After constructing the TB Hamiltonian matrix using the BioTinter-1m model for a DNA- 
protein complex, the direct diagonalization technique is applied to calculate various electronic 
structure properties. Currently, the HOMO and LUMO orbitals for each site are used as the basis 
functions, of course additional frontier orbitals could be easily included in our model as basis 
functions. As shown in Figure 7, the HOMO/LUMO gap in water solvent is larger than that in 
vacuum. This is quite similar as the results for model systems with DFT calculations. The frontier 
orbitals, especially the HOMO and LUMO orbitals are highlighted with complex network 


methodologies (Figure 7). The network displays the molecular orbital with larger node size for 


each residue that has large coefficients. The location of frontier orbitals is generally limited to a 
few amino acids and nucleobases. The distance between nodes is related to their sequence distance. 
Adjacent nodes on the network, indicate they are relatively close in secondary sequence structures. 

Despite its simplifications, the complex network analysis demonstrates an exceptional ability 
to place electronic structure variants on equal footing. The distribution of the HOMO and LUMO 
orbitals is generally much more dispersed in the implicit solvent model than in the vacuum model. 
The frontier orbitals have very distinct feature for each kind of DNA-protein complex. It is 
interesting to note that this structurally important residue identified as a hub is observed at the 
DNA-protein interface or the boundary residues of the DNA chain. In the computational model, 
the number of residues in the DNA chain generally does not exceed twenty residues, which may 
lead to boundary residues contributing to the frontier molecular orbitals. For the Jun/Jun:meTRE 
complex, the HOMO/LUMO orbitals are primarily distributed across amino acids and nucleobases 
that are relatively distant from each other. This distribution could indicate that the electronic 
structure of the complex facilitates charge transfer over long distances, a phenomenon that is 
crucial for many biological processes, such as signal transduction and energy transfer. This is 
consistent with the report that Methylation may cause significant changes to the photo-stability of 
nucleic acids, resulting in these sites becoming mutational hotspots for diseases such as skin 
cancer. This analysis is helpful to unravel the richness of biological electronic structure variants in 
realistic DNA binding protein complexes, which would evolve with fluctuating biomolecules 


structures. 
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Figure 8. The density of states (DOS) plots for DNA-protein complexes are illustrated, 
contrasting calculations in vacuum (black line) with those in a water solvent environment (red 
line). (a) The Jun/Jun homodimer interacting with the TRE response element. (b) The Jun/Jun 
homodimer with the meTRE response element. (c) The Zta*/Zta* homodimer with the TRE 


element. 


Figure 8 presents a comparative analysis of the electronic structures of DNA-protein 
complexes. The analysis is presented through their density of states (DOS) under vacuum and 
aqueous conditions. The electronic properties are significantly influenced by solvent effects, 


which shift and broaden energy states around the HOMO and LUMO levels, as detailed in Figure 


7. This demonstrates the role of the solvent in stabilizing electronic states. The peaks in the DOS 
become more pronounced and concentrated, and there are alterations in peak positions and 
substantial changes in peak intensities. These changes underscore the critical impact of the solvent 
on the electronic properties at the DNA-protein interface, where HOMO and LUMO are 
predominantly associated with interfacial residues. The Mulliken charges for each residue were 
calculated. Figure S5 displays scatter plots of the Mulliken charge populations for DNA/protein 
complexes in both vacuum and aqueous environments. A consistent pattern emerges across the 
complexes Jun/Jun:TRE, Jun/Jun:meTRE, and Zta*/Zta*:TRE, where the distribution of charges 
on amino acids and nucleobases appears relatively stable in water but exhibits subtle shifts in 


vacuum. 


Conclusions 

Protein-DNA interactions are essential for various cellular processes such as replication, 
transcription, recombination, and DNA repair. Here, a library of Tight-Binding (TB) parameters 
has been derived for amino acids and nucleobases, containing millions of conformers. Machine 
learning methods were used to predict TB parameters for arbitrary fragments of amino acids and 
nucleobases. The electronic structure variants of the AP-1 and Epstein-Barr Virus Zta 
transcription factors were studied in relation to their respective transcription factor sequences and 
binding DNA sequences. The direct diagonalization scheme was utilized to obtain the tight- 
binding molecule orbitals. Our results, including DOS and frontier molecular orbitals, demonstrate 
significant variations in electronic structure as the protein or DNA sequence changes. This work 
presents a cost-effective computational tool for analyzing the electronic structure of DNA-protein 
structures. These insights contribute to exemplify the complex interdependence of structure, 


sequence, and electronic properties in the regulation of gene expression. 
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-10 
Transcription factors that bind to DNA modulate gene expression, with the stability and reactivity 
of their interactions elucidated by eigenvalues derived from the tight-binding model. Visualization 
of these interactions reveals the Highest Occupied Molecular Orbital (HOMO) and the Lowest 
Unoccupied Molecular Orbital (LUMO), the gap between which determines the reactivity and 
stability of the molecular complex. This approach advances our understanding of gene regulation 
by revealing the dynamics of charge transfer and electronic states within transcription factor-DNA 


complexes. 


